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Independent  checkpointing  allows  maximum  process  autonomy  but  suffers  from  potential  domino 
effects.  Coordinated  checkpointing  eliminates  the  domino  effect  by  sacrificing  a  certain  degree  of 
process  autonomy.  In  this  paper,  we  propose  the  technique  of  lazy  checkpoint  coordination  which 
preserves  process  autonomy  while  employing  communication-induced  checkpoint  coordination  for 
bounding  rollback  propagation.  The  introduction  of  the  notion  of  laziness  allows  a  flexible  trade¬ 
off  between  the  cost  for  checkpoint  coordination  and  the  average  rollback  distance.  Worst-case 
overhead  analysis  provides  a  means  for  estimating  the  extra  checkpoint  overhead.  Communication 
trace-driven  simulation  for  several  parallel  programs  is  used  to  evaluate  the  benefits  of  the  proposed 
scheme  for  real  applications. 
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1  Introduction 


Independent  (or  uncoordinated)  checkpointing  [1-3]  for  parallel  and  distributed  systems  allows 
maximum  process  autonomy  smd  independent  design  of  recovery  capability  for  each  process.  How¬ 
ever,  since  the  rollback  of  a  message  sender  requires  the  sympathetic  rollback  [4]  of  the  receiver, 
the  domino  effect  [5]  is  in  general  possible  unless  certain  mechanisms  are  incorporated  into  the 
checkpointing  and  recovery  protocol  to  guarantee  recovery  line  [6]  progression.  Existing  techniques 
for  achieving  domino-free  rollback  recovery  can  be  classified  into  two  primary  categories  [7].  The 
first  category  can  be  called  the  minimum  sympathetic  rollback  approach  in  which  either  the  rollback 
of  a  process  will  never  undo  any  messages  sent  or  the  receiver  of  an  undone  message  M  will  try  to 
roll  back  to  the  state  immediately  before  receiving  M.  Wu  and  Fuchs  [8]  insert  a  checkpoint  imme¬ 
diately  after  each  message  is  sent  so  that  no  sympathetic  rollback  is  necessary  for  any  failure.  Kim 
et  al.  [9,10]  and  Venkatesh  et  al.  [11]  employ  dependency  tracking  and  insert  extra  checkpoints 
before  processing  any  messages  that  result  in  new  dependency.  The  state-interval  based  approach 
[12-21]  models  the  program  execution  as  consisting  of  a  number  of  state  intervals,  each  started  by 
processing  a  new  message.  Message  logging  in  addition  to  checkpointing  is  employed  to  effectively 
insert  an  “checkpoint”  (in  the  optimized  form  of  a  message  log)  before  each  message  receipt. 

The  second  category  can  be  called  the  bounded  rollback  propagation  approach.  Corresponding 
checkpoints  (based  on  the  ordinal  numbers)  on  different  processes  are  required  to  coordinate  with 
each  other  in  order  to  form  a  recovery  line  to  bound  the  possible  rollback  propagation.  Usually, 
whenever  a  checkpoint  is  initiated  by  one  process,  all  other  processes  are  informed  and  required 
to  take  appropriate  checkpoints  to  guar2mtee  the  resulting  set  of  checkpoints  is  consistent  [22-27]. 
The  number  of  processes  required  to  participate  in  each  checkpointing  session  can  be  reduced  by 
monitoring  the  recent  message  exchanging  history  [28].  For  systems  with  clock  synchronization 
and/or  bounded  message  transmission  delay,  the  cost  for  checkpoint  coordination  can  be  further 
reduced  [29-32]. 

We  will  use  the  term  eager  checkpoint  coordination  for  the  coordination  action  performed  when 
checkpoints  are  initiated  (as  described  above).  In  contrast,  processes  in  a  system  with  lazy  check¬ 
point  coordination  only  coordinate  their  corresponding  checkpoints  when  the  message  communica- 
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tion  indicates  a  violation  of  checkpoint  consistency^.  Briatico  et  al.  [35]  force  the  receiver  of  a 
message  M  to  take  a  checkpoint  before  processing  M  if  the  sender’s  checkpoint  ordinal  number 
tagged  on  M  is  greater  than  that  of  the  receiver.  Checkpoints  with  the  same  ordinal  numbers 
are  therefore  always  guaranteed  to  be  consistent.  However,  the  run-time  overhead  may  be  pro¬ 
hibitively  high  due  to  the  possibly  excessive  number  of  extra  induced  checkpoints.  In  this  paper, 
we  generalize  the  concept  of  communication-induced  checkpoint  coordination  by  introducing  the 
notion  of  laziness  Z  as  a  measure  of  the  frequency  for  performing  coordination.  Only  corresponding 
checkpoints  with  ordinal  numbers  nZ,  where  n  is  an  integer,  are  required  to  be  consistent  with  each 
other  and  form  the  recovery  line  for  bounding  rollback  propagation.  Overhead  analysis  and  exper¬ 
imental  evaluation  show  that  lazy  checkpoint  coordination  can  significantly  reduce  the  number  of 
extra  checkpoints  and  offer  a  flexible  trade-off  between  run-time  overhead  versus  average  rollback 
distance. 

The  paper  is  organized  as  follows.  Section  2  describes  the  system  model  and  the  checkpointing 
and  recovery  protocol;  Section  3  gives  the  motivation  and  the  algorithm  for  lazy  checkpoint  coor¬ 
dination;  Worst-case  overhead  analysis  is  presented  in  Section  4  and  the  trace-driven  simulation 
results  for  several  parallel  programs  are  discussed  in  Section  5. 


2  Checkpointing  and  Rollback  Recovery 


The  system  considered  in  this  paper  consists  of  a  number  of  concurrent  processes  for  which 
all  process  communication  is  through  message  passing.  Processes  are  assumed  to  run  on  fail-stop 
processors  [36]  and  each  processor  is  considered  as  an  individual  recovery  unit  [15].  We  do  not 
cissume  the  piecewise  deterministic  execution  model  [20]. 

During  normal  execution,  the  state  of  each  processor  is  periodically  saved  as  a  checkpoint  on 
stable  storage.  Let  CP,.*  denote  the  kth.  checkpoint  of  processor  pi  with  A:  >  0  and  0  <  t  <  .V  -  1. 
where  N  is  the  number  of  processors.  A  checkpoint  interval  is  defined  to  be  the  time  between 
two  consecutive  checkpoints  on  the  same  processor  and  the  interval  between  CP,,*  and  CP,  (*4.1) 

*The  basic  idea  motivating  the  laay  checkpoint  coordination  is  similar  to  the  concepts  behind  the  lazy  release 
consistency  in  distributed  shared  memory  [33]  and  the  lazy  message  cancellation  in  optimistic  distributed  simulation 
systems  [34]. 
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is  called  the  fcth  checkpoint  interval.  Each  message  is  tagged  with  the  current  checkpoint  ordinal 
number  and  the  processor  number  of  the  sender.  Each  processor  takes  its  checkpoint  independently 
and  updates  the  direct  dependency  information  table  (or  input  table  [2])  as  follows:  if  at  least  one 
message  from  the  mth  checkpoint  interval  of  processor  pj  has  been  processed  during  the  previous 
checkpoint  interval,  the  pair  (j,  m)  is  added  to  the  table  entry  for  the  new  checkpoint. 

A  centralized  garbage  collection  algorithm  [37]  can  be  periodically  invoked  by  any  processor. 
First,  the  dependency  information  for  all  existing  checkpoints  is  collected  to  construct  the  checkpoint 
graph  [1]  (Fig.  1(b)).  AH  checkpoints  corresponding  to  the  vertices  marked  “X”  in  Fig.  1  (b)  are 
determined  to  be  garbage  by  the  algorithm  and  can  therefore  be  discarded. 

When  processor  p,-  initiates  a  rollback,  it  sends  out  a  rollback-initiating  message  [2]  to  ev¬ 
ery  other  processor  to  request  the  up-to-date  dependency  information.  Each  surviving  processor 
takes  a  virtual  checkpoint  (represented  by  the  dotted  vertex  in  Fig.  1  (c))  upon  receiving  the  roll¬ 
back-initiating  message.  After  receiving  the  responses,  p,-  constructs  the  extended  checkpoint  graph 
[1]  and  executes  the  rollback  propagation  algorithm  shown  in  Fig.  2  to  determine  the  recovery  line 
(the  shaded  vertices  in  Fig.  1  (c)).  A  rollback-request  message  is  then  broadcast  to  roll  back  each 
processor  according  to  the  recovery  line  (Fig.  1  (d)). 

There  are  two  primary  checkpoint  consistency  situations.  In  Fig.  3(a),  the  checkpoints  CPi,k 
and  CPj,m  are  inconsistent  because  of  the  orphan  message  [31]  Afa.  In  Fig.  3(b),  CPi,k  and  CPj^m 
can  become  consistent  if  the  channel-state  message  [24]  Mt,  is  properly  recorded.  In  this  paper,  we 
assume  either  every  message  is  synchronously  logged^  [12, 14]  or  an  end-to-end  transmission  protocol 
can  guarantee  the  redelivery  of  the  lost  channel-state  messages  [28].  Therefore,  checkpoints  like 
CPi,k  and  CPj^rn  in  Fig.  3(b)  are  considered  consistent. 

3  Lazy  Checkpoint  Coordination 


3.1  Motivation 


We  will  refer  to  the  checkpoints  initiated  independently  by  each  processor  as  basic  checkpoints  and 
those  triggered  by  the  communication  as  induced  checkpoints.  Fig.  4(a)  illustrates  the  situation 

Discussions  on  incorporating  an  asynchronous  logging  protocol  into  the  independent  checkpointing  scheme  de¬ 
scribed  in  this  section  can  be  found  in  [3]. 
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Figure  1:  Checkpointing  and  rollback  recovery  (a)  the  checkpoint  and  communication  pattern  (b) 
checkpoint  graph  for  garbage  collection  (c)  extended  checkpoint  graph  when  po  initiates  the  rollback 
(d)  checkpoint  graph  after  recovery. 


/*  CP  stands  for  checkpoint  */ 

/*  Initially,  all  the  CPs  are  unmarked  */ 

Include  the  latest  CP  of  each  processor  in  the  root  set; 

Mark  all  CPs  strictly  reachable  from  any  CP  in  the  root  set; 

While  (at  least  one  CP  in  the  root  set  is  marked)  { 

Replace  each  marked  CP  in  the  root  set  by  the  latest  unmarked  CP  on  the  same 
processor; 

Mark  all  CPs  strictly  reachable  from  any  CP  in  the  root  set; 

} 

The  root  set  is  the  recovery  line. 


Figure  2:  The  rollback  propagation  algorithm. 
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Pi 


Pi 


Figure  3:  Checkpoint  consistency  (a)  orphan  message  (b)  channel-state  message. 


where  the  communication  pattern  renders  most  of  the  basic  checkpoints  useless  for  rollback  recovery 
and  the  only  recovery  line  is  at  the  very  beginning  of  the  execution.  A  straightforward  way  of 
avoiding  such  unbounded  rollback  propagation  is  to  perform  eager  checkpoint  coordination  as  shown 
in  Fig  4(b).  Whenever  a  processor  initiates  a  basic  checkpoint,  coordination  messages  (dotted 
arrows)  are  broadcast  to  all  other  processors  to  request  the  cooperation  in  making  a  consistent  set 
of  checkpoints  [23].  Let  B  be  the  total  number  of  basic  checkpoints  and  I  be  the  total  number  of 
induced  checkpoints.  We  define  the  induction  ratio  H.  as 

K  =  |  (1) 

which  is  a  measure  of  the  overhead  for  performing  communication-induced  checkpoint  coordination. 
Clearly,  eager  checkpoint  coordination  has  71  =  N  —  I  and  will  result  in  large  run-time  overhead 
when  N  is  large.  In  addition,  the  iV  —  1  coordination  messages  per  checkpoint  session  constitute 
another  overhead. 

The  large  overhead  of  eager  checkpoint  coordination  results  from  its  pessimistic  nature.  More 
specifically,  when  px  in  Fig  4(b)  initiates  its  first  basic  checkpoint  it  “pessimistically”  assumes 
that  messages  like  Mi  will  exist  in  the  future  and  cause  6i,i  to  be  inconsistent  with  its  corresponding 
checkpoint  6o,i  on  po-  In  order  to  guarantee  6i,i  belongs  to  a  useful  recovery  line,  pi  “eagerly” 
requests  po’s  cooperation  at  the  time  6i,i  is  initiated.  In  contrast,  lazy  checkpoint  coordination 
adopts  an  optimistic  approach  by  assuming  that  6o,i  will  be  consistent  with  6i,i.  If  the  assumption 
turns  out  to  be  true,  no  explicit  coordination  is  necessary.  An  extra  checkpoint  wiU  be  induced  on  po 
only  when  the  message  Mi  indicates  that  the  assumption  has  failed  (Fig  4(c)).  From  another  point 
of  view,  such  a  scheme  “lazily”  delays  the  broadcast  of  the  coordination  messages  and  implicitly 

*bi,k  denotes  the  fcth  basic  checkpoint  of  p,  and  CPi,k  denotes  the  fcth  checkpoint  of  pi. 
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(d) 


^1.1  ^1.2 


+  Basic  checkpoint  9  Induced  checkpoint 

Figure  4:  Communication-induced  checkpointing  (a)  the  checkpoint  and  communication  pattern  (b) 
eager  checkpoint  coordination  (c)  lazy  checkpoint  coordination  with  laziness  =  1  (d)  la^iy  checkpoint 
coordination  with  laziness  =  2. 


6 


piggybacks  them  on  future  normal  messages.  Both  checkpoint  and  message  overhead  can  therefore 
be  reduced. 

However,  given  a  basic  checkpoint  pattern,  the  number  of  induced  checkpoints  in  the  above 
scheme  is  determined  by  the  communication  pattern  and  is  not  otherwise  controllable.  In  the  worst 
case,  the  induction  ratio  H  can  stiU  be  iV  -  1  as  illustrated  in  Fig  4(c).  In  order  to  further  reduce 
the  overhead,  we  can  perform  even  “laaier”  coordination  by  only  enforcing  the  consistency  between 
checkpoints  CPo.nZ  and  CPi,nZ  where  Z  is  called  the  laziness asid  n  is  an  integer.  Fig  4(d)  shows  the 
case  with  Z  =  2.  No  checkpoint  is  induced  until  the  message  M2  indicates  the  inconsistency  between 

61.2  and  6o,2-  The  number  of  induced  checkpoint  can  be  reduced  from  8  (Fig  4(c)  with  Z  =  1)  to  2  at 
the  cost  of  potentially  larger  rollback  distance.  It  becomes  clear  that  lazy  checkpoint  coordination 
can  provide  a  trade-off  between  the  checkpointing  overhead  and  average  rollback  distance. 

3.2  The  Protocol 

Our  approach  is  to  incorporate  the  lazy  checkpoint  coordination  into  the  independent  checkpoint¬ 
ing  scheme  as  a  mechanism  for  bounding  rollback  propagation.  Therefore,  the  checkpointing  and 
rollback  recovery  protocol  can  be  built  on  top  of  the  one  described  in  Section  2.  During  nor¬ 
mal  execution,  each  processor  still  takes  its  basic  checkpoints  independently.  The  laziness  Z  is  a 
predetermined-determined  system  parameter  known  to  ail  processors.  Suppose  a  processor  pj  with 
current  checkpoint  ordinal  number  r  is  about  to  process  a  message  M  with  sender  p,’s  ordinal 
number  s.  If  pj  detects  the  following  condition  to  be  true 

I  =  [s/Z\  >  [r/ZJ, 

it  realizes  that  CPi,iz  and  CPjjz  will  be  inconsistent  unless  an  extra  checkpoint  is  induced  before 
M  is  processed.  We  describe  a  possible  implementation  as  follows.  Each  processor  pj  maintains  a 
variable  V  which  is  initialized  to  be  Z  and  incremented  by  Z  each  time  CPj^nZ  is  taken.  Before  pj 
processes  a  message  M  with  s  >  V,  it  is  forced  to  take  the  checkpoint  CPj^iz  and  update  its  ordinzJ 
number  counter  to  IZ.  In  other  words,  if  M  was  sent  after  CP^z  was  taken,  it  must  be  processed 
by  Pj  after  CPj^iz  is  induced.  Notice  that  all  checkpoints  CPj,m  with  r  <  m  <  IZ  become  dummy 
checkpoints  which  overlap  with  CPj^iz- 

In  addition  to  the  centralized  garbage  collection  algorithm  [37],  a  simple  distributed  algorithm 


can  also  be  used  for  low-cost  garbage  collection.  The  basic  idea  is  that  if  the  current  checkpoint 
ordinal  number  of  every  processor  has  exceeded  nZ,  all  the  checkpoints  CPj,m  with  m  <  nZ 
becomes  obsolete  with  respect  to  the  recovery  line  consisting  of  {CPi,nZ  :  0  <  i  <  —  1}  and 

therefore  can  be  discarded.  Each  processor  pj  needs  to  maintain  an  array  CP.progress[N]  which 
records  the  highest  ordinal  number  for  every  other  processor  known  to  pj  based  on  the  information 
included  in  each  message.  More  efficient  garbage  collection  can  be  achieved  by  piggybacking  the 
CP.progress[N]  array  on  the  normal  messages  periodically  in  order  to  maintain  the  “transitive” 
knowledge  of  checkpointing  progress  of  each  processor  [38]. 

Although  the  set  of  checkpoints  {CPi,nZ  :0<i<iV  —  1}  always  forms  a  recovery  line,  the 
two-phase  recovery  procedure  described  in  Section  2  should  still  be  used  to  search  for  the  most 
recent  recovery  line  in  order  to  minimize  the  number  of  rolled-back  processors  and  the  rollback 
distance.  One  possible  optimization  is  that  the  dependency  information  corresponding  to  the 
garbage  checkpoints  as  determined  based  on  the  CP.progress[N]  array  needs  not  be  collected,  thus 
reducing  the  size  of  the  responses  to  the  rollback-initiating  message  and  the  time  for  constructing 
the  checkpoint  graph. 


4  Overhead  Analysis 


Since  the  checkpoint  overhead  of  the  lazy  checkpoint  coordination  scheme  depends  on  the  run¬ 
time  dynamic  communication  pattern,  it  is  important  to  analyze  and  estimate  the  potential  extra 
overhead  resulting  from  the  induced  checkpoints.  We  will  first  show  that,  without  any  constraints 
on  the  relative  checkpointing  progress  of  each  processor,  the  worst-case  induction  ratio  is  ( iV  - 1  )/Z. 
While  under  certain  conditions  which  are  typically  met  by  real  applications,  the  upper  bound  on 
the  induction  ratio  can  be  shown  to  be  independent  of  N. 

4.1  Worst-Case  Analysis 

Our  approach  to  worst-case  analysis  consists  of  two  steps.  First,  given  any  fixed  basic  checkpoint 
pattern,  we  construct  the  worst-case  communication  pattern.  Secondly,  given  any  system  with  N 
processors,  we  derive  the  worst-case  induction  ratio  as  a  function  of  N  and  the  laziness  Z. 
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In  this  section,  we  aissume  each  checkpoint  C Pi,k  is  associated  with  a  global  time  stamp 
t(CPi,k)^.  For  any  checkpoint  and  communication  pattern  P,  define  C PfnZ  ~  ^ ^T,nZ^  ^T.nZ )  - 
t{CPl^z)  for  aU  0  <  i  <  ^  -  1,  i.e.,  CPf^^z  denotes  the  earliest  checkpoint  #nZ  among  all  pro¬ 
cessors.  Given  any  basic  checkpoint  pattern  and  the  laziness  Z,  we  construct  the  communication 
pattern  Vq  as  follows.  Suppose  CPf^2  -  ’Fhen  p<  sends  a  message  to  every  other  processor 

and  induces  CPj°2  with  t{CPf°2)  »s  t{CPf°2)  processor  pj.  Fig.  5(a)  shows  an  example  of  Vo 
with  Z  =  2.  We  will  call  the  interval  between  P^(n-i)Z^  induction  session 

#n  which  includes  all  the  induced  checkpoints  CPj°2-  following  lemma  will  be  used  to  prove 
Vq  is  the  worst-case  communication  pattern  in  terms  of  the  induction  ratio. 

LEMMA  1  Given  a  basic  checkpoint  pattern,  we  have  t{CP^^2)  ^  Z®'*  arbitrary  com¬ 

munication  pattern  V  and  any  positive  integer  n. 

Proof.  The  proof  is  given  by  induction  on  n.  Since  there  can  not  be  any  induced  checkpoint 
before  t{CP^2)  °“^y  depends  on  the  progress  of  taking  basic  checkpoints. 

Therefore,  t(CP^2)  =  t{CP^2)  ^d  the  case  n  =  1  is  true.  For  the  case  n  =  fc,  suppose  CP^j^2  - 
CPfj^2-  ^  checkpoints  CPfi  with  {k  —  l)Z  <  I  <  kZ  must  be  basic  checkpoints  because  they 

can  not  be  induced  before  t{CP^i,2)'  Also,  £  ^i^^i!{k~i)z)  -  -  ^i^^ukz) 

by  definition.  Suppose  the  case  n  =  fc  —  1  is  true,  i.e.,  We  then  have 

CPfi^2  —  '^here  q  >  kZ  because  ~  ^^{k-i)Z^  *^y  construction  and  there  are 

at  least  Z  basic  checkpoints  of  p,-,  i.e.,  the  CP^i‘’s,  between  and  t{CP^i^2)-  Finally, 

<  Kcp^)  =  t{cp!;,2^  =  t(cpr,,2) 

and  we  have  proved  t{CP^^2)  ^  ®dl  positive  integer  n.  □ 

LEMMA  2  Given  a  basic  checkpoint  pattern,  Vo  is  the  worst-case  communication  pattern  resulting 
in  the  largest  induction  ratio. 

^This  is  only  for  the  purpose  of  presentation. 

*We  will  use  CPfj,  to  denote  the  fcth  checkpoint  of  pi  in  the  checkpoint  and  communication  pattern  V.  When  it  is 
clear  from  the  context  that  the  basic  checkpoint  pattern  is  fixed,  we  also  use  the  saune  notation  for  the  communication 
pattern  V. 
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Figure  5:  (a)  Worst-case  communication  pattern  (b)  worst-case  checkpoint  and  communication 
pattern. 

Proof.  Let  denote  the  total  number  of  induced  checkpoints  with  ordinal  number  nZ  for 
the  communication  pattern  V,  and  let  q  =  max{n  :  ^  0}.  be  the  maximum  among  n's  such 

that  0.  Cleatrly,  <  N  —  1.  Since  ^  by  Lemma  1,  the  checkpoint  and 

communication  pattern  with  Vq  must  consist  of  at  least  q  induction  sessions.  Let  denote  the 
total  number  of  induced  checkpoints  for  V,  we  then  have 

l<n<q  1<"<9 

Finally,  because  the  number  of  basic  checkpoints  is  fixed  by  the  given  basic  checkpoint  pattern.  Pq 
has  the  largest  induction  ratio  among  all  possible  communication  patterns.  □ 

Lemma  2  states  that,  for  worst-case  analysis  of  the  induction  ratio,  we  need  only  consider  the 
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communication  pattern  Vq  for  each  basic  checkpoint  pattern.  Because  the  induction  sessions  are 
well-defined  in  such  patterns  (as  shown  in  Fig.  5),  the  derivations  can  be  simplified. 


THEOREM  1  For  any  system  with  N  processors  and  laziness  Z,  the  induction  ratio 

^  ^  ^  -1 

Proof.  For  any  basic  checkpoint  pattern  with  its  corresponding  Vo  which  results  in  L  complete 
induction  sessions,  the  number  of  induced  checkpoints  is  £  -(iV  —  1).  Let  denote  the  number  of 
bcisic  checkpoints  within  the  induction  session  ^n,  we  have  >  Z  for  all  1  <  n  <  £  because  the  Z 
checkpoints  with  (n  -  1)Z  <  I  <  nZ  can  not  be  the  induced  checkpoints  if  =  CPf^^- 

Therefore,  the  induction  ratio 

^  £-(iV-l)  ^  L-jN-l)  ^  N-1^ 

lll<n<L  +  Bl+i  ~  L  ■  Z  Z 

□ 

Fig.  5(b)  shows  an  example  of  the  worst  case  iot  N  =  Z  and  Z  =  2.  The  stacked  checkpoints 
indicate  the  fact  that  each  dummy  checkpoint  CPffn-i  overlaps  with  the  induced  checkpoint  CPffn- 
Since  it  takes  exactly  Z  =  2  basic  checkpoints  to  induce  every  iV  -  1  =  2  checkpoints,  the  induction 
ratio  is  {N  —  1)/Z  =  1, 

4.2  The  Upper  Bound  under  Constraints 

The  upper  bound  in  Theorem  1  was  derived  under  no  constraints  on  the  program  behavior.  Since 
it  is  of  order  0(iV),  the  induction  ratio  may  be  unacceptably  high  for  systems  with  large  number 
of  processors.  However,  a  closer  look  at  the  two  patterns  in  Fig.  5  reveals  that  the  situation  in  (b) 
which  results  in  the  worst-case  induction  ratio  is  less  likely  to  happen  for  real  applications  where 
the  processors  typically  regTilarize  their  paces  in  taking  basic  checkpoints,  as  shown  in  (a).  For 
example  in  Fig.  5(b),  it  is  very  likely  for  po  to  take  at  least  one  basic  checkpoint  between  CP^l 
and  CPJ’q.  We  can  show  that  under  the  following  constraints  which  are  usually  satisfied  in  real 
applications,  the  upper  bound  on  the  induction  ratio  is  independent  of  N . 

Constraint  1:  Let  Q  denote  the  ratio  of  the  maximum  to  the  minimum  length  of  the  basic  check¬ 
point  interval.  Although  each  processor  is  allowed  to  take  its  basic  checkpoints  at  its  own  pace. 


Q  is  typically  bounded  by  a  small  constant  Q.  For  example,  Q  is  2  or  3  for  our  experiments 
described  in  the  next  section. 

Constraint  2:  Only  the  cases  with  Z  >2  will  be  considered  for  refined  upper  bounds  because  the 
worst  case  for  Z  =  1  is  always  achievable  even  when  Q  is  small  (see  Fig.  4(c)). 

Constraint  3:  The  applications  employing  checkpointing  and  rollback  recovery  are  usually  long- 
running  jobs,  which  implies  Z  •  £  is  quite  large.  (Recall  L  is  the  number  of  complete  induction 
sessions  with  Vq.)  In  particular,  we  assume  Z  ■  L  fQ). 

THEOREM  2  Under  the  above  constraints,  the  induction  ratio  "R.  <  fQ] . 

Proof.  Again  we  only  have  to  consider  Vq  for  each  basic  checkpoint  pattern  for  the  worst  case. 
Let  M  denote  the  smallest  integer  such  that  Af -(Z  — 1)  >  Q.  Since  Z  >  2  by  Constraint  2,  we  have 
A/  <  [<3].  We  define  an  M -induction  session  as  consisting  of  M  consecutive  induction  sessions. 
There  are  then  Lm  —  L-^/AfJ  complete  Af-induction  sessions,  eau:h  containing  M  -{N  -\)  induced 
checkpoints.  We  consider  the  following  two  cases. 

(a)  N  <  Mi  By  Theorem  1, 

<M  <\Q].  (2) 

(b)  iV  >  Mi  First  we  consider  the  number  of  induced  checkpoints  I.  If  Z  >  Q  -|-  1,  then  M  =  I 

and  I  =  L  •  {N  —  1).  IfZ<Q-i-l,  Z-L>  fQ]  in  Constraint  3  implies  L  >  fQ).  Since 
M  <  fQl,  we  have  L/M  >  1  and 

I  =  Lm-M-(N-1)+  Ik»  Lm-M -{N  -1). 

LMM-¥l<k<L 

In  either  case,  I  «  Lm  •  Af  •  (iV  —  1). 

Now  consider  the  number  of  basic  checkpoints  B.  For  each  induction  session  #n,  the  processor 
Pi  with  CPf^2  -  contribute  Z  basic  checkpoints  and  therefore  the  length  of 

each  induction  session  is  at  least  Z  -  1  basic  checkpoint  intervals.  Within  each  A/-induction 
session,  at  least  N  -  M  processors  do  not  have  CPf°2  -  By  *^he  definition 
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of  Q,  these  N  —  M  processors  must  each  contribute  at  least  basic  checkpoints. 

Therefore, 


Q 


and 


11  = 


M-{N  -\) 


B  ^  M -  M)-  ■ 

Since  Z  >  1  and  >  1  by  definition,  we  have 

^  <  T7 - 7T7 - <  M  <  fQl . 

M  +  {N  -M)  '  ' 


(3) 


(4) 

□ 


Notice  that  Eqs.  (3)  and  (4)  are  still  valid  if  we  replace  M  with  any  m  such  that  M  <  m  <  [Q] . 
By  combining  Theorem  1,  Eq.  (2)  and  Eq.  (3),  we  then  define  the  refined  upper  bound,  called  the 
Q  —  bound,  as  follows. 

Q-ion:ui.  Z  +  i7>-^)'.  «iv’- L)  ■ 

where  [iV  >  m]  =  1  if  iV  >  m  is  true  and  0  otherwise. 

Fig.  6(a)  compares  the  worst-case  induction  ratio  with  the  Q  —  hound  where  Q  =  2  for  iV  =  8, 16 
and  32.  While  the  worst-case  ratio  {N  —  \)IZ  clearly  grows  with  N ,  the  Q  -  bound  is  relatively 
insensitive  to  N.  Fig.  6(b)  compares  the  worst-case  induction  ratio,  which  is  equivalent  to  the 
Q  —  bound  with  Q  =  oo,  with  the  Q  —  bound  where  Q  varies  from  2  to  5.  Since  our  purpose  of 
introducing  the  Q  —  bound  is  to  estimate  the  induction  ratio  for  real  applications  in  advance,  the 
insensitivity  of  the  Q— bound  to  the  exact  value  of  Q  suggests  that  an  approximate  value  of  Q  suffices 
for  the  estimation.  Finally,  notice  that  if  Z  is  chosen  to  be  at  least  Q  -|-  1,  we  have  Tl  <  M  =  1. 
i.e.,  the  number  of  induced  checkpoints  will  never  exceed  the  number  of  basic  checkpoints. 


5  Experimental  Results 


Four  parallel  programs  written  in  the  Chare  Kernel  language  are  used  for  the  communica¬ 
tion  trace-driven  simulation.  The  Chare  Kernel  has  been  developed  as  a  medium-grain,  machine- 
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Figure  6:  (a)  Worst-case  induction  ratio  and  the  Q  -  bounds  (Q=2)  for  various  N  (b)  worst-case 
induction  ratio  (iV  =  32)  and  the  Q  -  bounds  for  various  Q. 
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independent  parallel  language  [39].  Programs  written  in  the  Chare  Kernel  language  can  run  un¬ 
changed  on  both  shared-memory  and  distributed-memory  machines  such  as  Encore  Multimajc,  Se¬ 
quent  Symmetry,  Intel  iPSC/2  and  i860  hypercubes  and  a  network  of  Sun  workstations.  Program 
traces  used  in  this  paper  are  collected  from  an  Multimax  510. 

The  four  programs  include  two  newly  developed  CAD  applications,  Test  generation  and  Logic 
synthesis,  and  two  search  applications,  Knight  tour  and  N  queen.  The  execution  times  are  between 
25  and  45  minutes  (see  Table  1).  The  total  number  of  messages  ranges  from  tens  to  hundreds  of 
thousands.  Our  simulation  uses  the  following  scheme  for  inserting  checkpoints.  The  predetermined 
minimum  basic  checkpoint  interval  is  chosen  to  be  2  minutes.  A  variable  NexLCP.Time  is  initialized 
to  2  minutes.  Each  processor  checks  its  local  clock  after  processing  every  100  messages.  If  the  clock 
time  exceeds  Next.CP.Time,  a  basic  checkpoint  is  inserted  and  Next.CP-Time  is  incremented  by 
2  minutes.  The  resulting  average  basic  checkpoint  interval  (CPI)  for  each  program  is  listed  in 
Table  1.  Before  processing  each  message,  the  processor  also  checks  if  an  induced  checkpoint  and 
the  corresponding  update  of  the  ordinal  number  counter  are  necessary,  as  described  in  Section  3. 
All  reported  numbers  are  averaged  over  five  runs. 

Table  1:  Execution  and  checkpoint  parameters  of  the  Chare  Kernel  programs. 


Programs 

Test 

generation 

Logic 

synthesis 

Knight 

tour 

N 

queen 

Number  of  processors 

8 

6 

8 

6 

Execution  time  (sec) 

2,076 

1,736 

2,436 

1,567 

Number  of  messages 

28,219 

411,733 

104,170 

25,880 

Average  number  of  basic 
checkpoints  per  processor 

12.6 

11.8 

18.0 

10.5 

Average  basic  CPI  (sec) 

158 

140 

132 

139 

Q 

2.17 

2.48 

1.42 

1.55 

Under-2  percentage 

99.6% 

97.0% 

100% 

100% 

We  expect  the  variation  of  the  basic  checkpoint  interval  to  be  small  because  of  the  way  it  is 
maintained.  In  particular,  we  choose  Q  =  2  to  estimate  the  induction  ratio.  The  exact  value  of 
Q  for  each  program  is  listed  in  Table  1.  Although  Q  is  slightly  greater  than  2  for  the  first  two 
programs,  the  numbers  listed  in  the  row  of  “Under-2  percentage”  shows  that  a  very  high  percentage 
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Figure  7;  Checkpoint  coordination  overhead  as  a  function  of  laziness  (a)  Test  generation  (b)  Logic 
synthesis  (c)  Knight  tour  (d)  N  queen. 
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Figure  8:  Average  rollback  distance  as  a  function  of  the  laziness. 

of  the  baisic  checkpoint  intervals  are  covered  by  Q  =  2.  Also,  since  the  Q  —  bound  is  insensitive  to 
the  exact  value  of  Q,Q  =  2  should  suffice  for  our  purpose.  Fig.  7  plots  the  Q  -  bounds  against  the 
worst-case  and  the  actual  induction  ratios  for  the  four  programs.  It  is  shown  that  the  Q  -  bound 
provides  a  good  estimation  of  the  induction  ratio  for  real  applications.  The  large  difference  in  the 
ratio  between  Z  =  1  and  Z  >  2  confirms  that  our  generalization  of  the  idea  of  communication- 
induced  checkpoint  coordination  as  described  in  [35]  can  significantly  reduce  the  extra  checkpoint 
overhead. 

Fig.  8  gives  the  average  rollback  distances  in  terms  of  the  number  of  average  basic  CPIs.  The 
almost  linear  behavior  can  be  explained  as  follows.  Every  N  basic  checkpoints  bi^s,  0  <  i  <  N 
are  taken  at  approximately  the  same  time  tk-  K  any  one  of  them,  say  6^^,  is  CP»,nZ,  then  either  6,, it 
is  consistent  with  bj^k  or  CPi^nZ  is  induced  shortly  due  to  the  relatively  l«urge  number  of  messages. 
Hence,  a  recovery  line  is  formed  around  For  Z  =  1,  that  means  the  average  rollback  distance  is 
at  most  0.5  basic  CPI  and  the  exact  value  will  depend  on  the  offset  between  6,-,jt’s  at  run-time.  For 
Z  >  2,  as  long  as  some  C/\,„7’s  are  induced  before  6i>’s  are  initiated,  become  CP,,n^+i’s  and 
one  of  6,,jb+(^-i)’s  will  become  which  means  a  new  recovery  line  will  very  likely  to  exist 

around  tjt+(^_i).  Therefore,  the  average  rollback  distance  is  approximately  (Z  -  l)/2  basic  CPIs 
as  shown  by  the  curve  named  “Estimated”  in  Fig.  8.  It  becomes  clear  that  Figs.  7  and  8  provide  a 
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flexible  trade-ofF  between  run-time  overhead  and  recovery  efficiency. 


6  Concluding  Remarks 

We  have  proposed  the  technique  of  lazy  checkpoint  coordination  and  incorporated  it  into  the  in¬ 
dependent  checkpointing  protocol  as  a  mechanism  for  bounding  rollback  propagation.  The  recovery 
line  is  guaranteed  to  move  forward  by  performing  communication-induced  checkpoint  coordination 
only  when  the  predetermined  consistency  criterion  is  about  to  be  violated.  The  notion  of  laziness 
was  introduced  to  provide  the  trade-off  between  extra  checkpoint  overhead  during  normal  execution 
versus  the  average  rollback  distance  for  recovery.  Overhead  analysis  shows  that  the  upper  bound  on 
the  induction  ratio,  i.e.,  the  number  of  induced  checkpoints  divided  by  the  number  of  basic  check¬ 
points,  is  related  to  the  maximum  ratio  between  the  basic  checkpoint  intervals.  Communication 
trace-driven  simulation  results  for  several  parallel  programs  showed  that  our  analysis  can  provide  a 
good  estimation  for  the  induction  ratio,  and  lazy  checkpoint  coordination  can  significantly  reduce 
the  extra  checkpoint  overhead  for  real  applications. 
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